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Abstract 

We consider using an ensemble of binary classifiers for transductive prediction, when unlabeled test 
data are known in advance. We derive minimax optimal rules for confidence-rated prediction in this 
setting. By using PAC-Bayes analysis on these rules, we obtain data-dependent performance guarantees 
without distributional assumptions on the data. Our analysis techniques are readily extended to a setting 
in which the predictor is allowed to abstain. 


1 Introduction 

Modern applications of binary classification have recently driven renewed theoretical interest in the problem 
of confidence-rated prediction nui. This concerns classifiers which, for any unlabeled data example, out¬ 
put an encoding of the classifier’s confidence in its own label prediction. Confidence values can subsequently 
be useful for active learning or further post-processing. 

Approaches which poll the predictions of many classifiers in an ensemble % are of particular interest for 
this problem gUS]. The Gibbs (averaged) classifier chooses a random rule from the ensemble and predicts 
with that rule. Equivalently, one can say that the prediction on a particular unlabeled example is randomly 
-|-1 or —1 with probabilities proportional to the number of votes garnered by the corresponding labels. This 
is intuitively appealing, but it ignores an important piece of information - the average error of the Gibbs 
predictor, which we denote by A. If the ratio between the -1-1 and —1 votes is more extreme than A, then 
the intuition is that the algorithm should be fairly confident in the majority prediction. The main result of 
this paper is a proof that a slight variation of this rough argument holds true, suggesting ways to aggregate 
the classifiers in T-L when the Gibbs predictor is not optimal. 

We consider a simple transductive prediction model in which the label predictor and nature are seen as 
opponents playing a zero-sum game. In this game, the predictor chooses a prediction gi S [—1,1] on the 
unlabeled example, and nature chooses a label Zi € [—1,1]. The goal of the predictor is to maximize the 
average correlation \giZi\ = ^ 'Y^=i 9i^i over n unlabeled examples, while nature plays to minimize this 
correlation. 

Without additional constraints, nature could use the trivial strategy of always choosing Zi = 0, in which 
case the correlation would be zero regardless of the choices made by the predictor. Therefore, we make one 
assumption - that the predictor has access to an ensemble of classifiers which on average have small error. 
Glearly, under this condition nature cannot use the trivial strategy. The central question is then: What is 
the optimal way for the predictor to combine the predictions of the ensemble? That question motivates the 
main contributions of this paper: 

• Identifying the minimax optimal strategies for the predictor and for nature, and the resulting minimax 
value of the game. 

^ This can be thought of as parametrizing a stochastic binary label; for instance, Zi = —0.5 would be equivalent to choosing 
the labels (—1,1) with respective probabilities (0.75,0.25). 
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• Applying the minimax analysis to the PAC-Bayesian framework to derive PAC-style guarantees. We show 
that the minimax predictor cannot do worse than the average ensemble prediction, and quantify situations 
in which it enjoys better performance guarantees. 

• Extending the analysis to the case in which the predictor can abstain from committing to any label and 
instead suffer a fixed loss. A straightforward modification of the earlier minimax analysis expands on prior 
work in this setting. 


2 Preliminaries 


The scenario we have outlined can be formalized with the following definitions. 


1. Classifier ensemble: A finite set of classification rules H = {hi ,..., hn} that map examples x G X 
to labels y G y := { — 1, !}■ This is given to the predictor, with a distribution q over H. 

2. Test set: n unlabeled examples Xi G X, denoted T = [xi, ..., a;„}. 

3. Nature: Nature chooses a vector z G [—1,1]" encoding the label associated with each test example. 
This information is unknown to the predictor. 


4. 


Low average error: Recall the distribution q over the ensemble given to the predictor. We assume 


that for some A > 0, aEE qjhj{xi)zi = A. So the average correlation between the prediction of 


n H 


i=l j = l 

a randomly chosen classifier from the ensemble and the true label of a random example from T is at 
least A. El 


5. Notation: For convenience, we denote by F the matrix that contains the predictions of (hi ,..., hjf) 
on the examples (xi,... ,Xn)- F is independent of the true labels, and is fully known to the predictor. 


/hi(xi) 

h2(xi) ■ 

■ hff(xi)\ 

hi(x2) 

h2(X2) ■ 

■ hH(x2) 

\hl(Xn) 

h 2 (x„) ■ 

• • 


In this notation the bound on the average error is expressed as: ^z^Fq > A. 


( 1 ) 


3 The Confidence-Rated Prediction Game 


In this game, the goal of the predictor is to find a function g : T [—1,1]" (a vector in K"), so that each 
example Xi maps to a confidence-rated label prediction gi G [—1,1]. The predictor maximizes the worst-case 
correlation with the true labels z^g, predicting with the solution g* to the following game: 


Find: max min —z^g 

g z n 

Such that: ^z^a > A and — 1" < z < 1", 

-1^ < g < 

^We could also more generally assume we are given a distribution r over T which unequally weights the points in T. The 
arguments used in the analysis remain unchanged in that case. 

^Equivalent to the average classification error being < ^(1 — A). 


( 2 ) 
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where a := Fq G represents the ensemble predictions on the dataset: 

( H H H 

^ qjhjixi) , qjhj{x2), ■ ■ ■ ,Y 
i=i i=i i=i 

It is immediate from the formulation § that the predictor can simply play g = a, which will guarantee 
correlation A due to the average performance constraint. This is the prediction made by the Gibbs classifier 
hq{x) = which averages across the ensemble. We now identify the optimal strategy for the 

predictor and show when and by how much it outperforms hq. 


3.1 Analytical Solution 

Consider the game as defined in ([^. The conditions for minimax duality hold here (Prop, [^in the appendices 
for completeness), so the minimax dual game is 

min max — z^g (3) 

ze[-i.i]",gG[-i.i]" n 
^z^a>A 


Without loss of generality, we can reorder the examples so that the following condition holds on the 
vector of ensemble predictions a. 

Ordering]^ Order the examples so that |ai| > |a 2 | > • • • > |a„|. □ 

Our first result expresses the minimax value of this game. (All proofs are deferred to the appendices.) 
Lemma 1. Using Ordering of the examples, let v = min|i € [n] : ^ l®il — Then the value of 


1 1 / 1 ^ ^ I 

the game ^ is V := - h -j — r ( A-'W lad ). 

'—' n n... \ n I 


i=l 


This allows us to verify the minimax optimal strategies. 


Theorem 2. Suppose the examples are in Ordering^ and let v be as defined in Lemma^ The minimax 
optimal strategies for the predictor (g*J and nature (z*) in the game ([^ are: 


sgn(aj) 


|a„| 


i < V 
i > V 


sgn(ai) 


0 



i < V 
i = V 
i > V 


3.2 Discussion 

As shown in Fig. ??, the optimal strategy g* depends on the Gibbs classifier’s prediction a in an elegant 
way. For the V fraction of the points on which the ensemble is most certain (indices i < v), the minimax 
optimal prediction is the deterministic majority vote. 

For any example i, g* is a nondecreasing function of a^, which is often an assumption on g in similar 
contexts [5]. In our case, this monotonic behavior of g* arises entirely from nature’s average error constraint; 
g itself is only constrained to be in a hypercube. 

^ If every h £ 'H makes identical predictions on the dataset and has correlation A with the true labels, then the ensemble 
effectively has just one element, so outperforming it is impossible without outside information. 
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When A « 0 « 1), g* approximates the ensemble prediction. But as the ensemble’s average correlation 

A increases, the predictor is able to act with certainty (|(;*| = 1) on a growing number of examples, even 
when the vote is uncertain (|ai| < 1). 

The value of the game can be written as 


T- v-1 1 / 1 

V —-h -r I A-\ 

Q I \ „ ^ 


n 


n 


i-l 


> -hA-^|ai| — A+-^(l- |ai|) 

77, r? 


n 




n 


i-l 


(4) 


This shows that predicting with g* cannot hurt performance relative to the average ensemble prediction, 
and indeed will help when there are disagreements in the ensemble on high-margin examples. The difference 
V — \ = ^ Si=i (1 ~ l®*l) quantifies the benefit of our prediction rule’s voting rule as opposed to the Gibbs 
classifier’s averaged prediction. 

Our minimax analysis is able to capture this uniquely vote-based behavior because of the transductive 
setting. The relationships between the classifiers in the ensemble determine the performance of the vote, 
and analyzing such relationships is much easier in the transductive setting. There is a dearth of applications 
of this insight in existing literature on confidence-rated prediction, with [7] being a notable exception. 

In this work, the predictor obtains a crude knowledge of the hypothesis predictions F through ensemble 
predictions a, and is thereby able to quantify the benefit of a vote-based predictor in terms of a. However, 
the loss of information in compressing F into a is unnecessary, as F is fully known to the predictor in a 
transductive setting. It would therefore be interesting (and provide a tighter analysis) to further incorporate 
the structure of F into the prediction game in future work. 


4 A PAC-Bayes Analysis of a Transductive Prediction Rule 

The minimax predictor we have described relies on a known average correlation A > 0 between the ensemble 
predictions and true labels. In this section, we consider a simple transductive statistical learning scenario 
in which the ensemble distribution q is learned from a training set using a PAC-Bayesian criterion, giving a 
statistical learning algorithm with a PAC-style analysis. 

Suppose we are in a transductive setting with a training set S with known labels, and a test set T with 
unknown labels. S and T are assumed to be composed of labeled examples drawn i.i.d. from a distribution 
V over X xy. We write IS”! = m and consider |r| > m. 

Denote the true error of a hypothesis h G H hy err (h) = Pr-p ih{x) ^ y), its empirical error on S by 
Sfs (/i) = ^ X](x y)GS empirical error on T by e^rih). Also, for any p,q G [0,1] 

define KL{p\\q)= plog | -|- (1 — p) log and otherwise define the KL divergence KL{p || q) between 
two distributions p, q G K" in the usual way as J2i^iPi log 



Figure 1: Optimal strategies g*,z* for 
the game without abstention (Thm. 
j^, plotted against at. 



Figure 2: Near-optimal abstain prob¬ 
abilities (Thm. [^, plotted 

against ai. 
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Figure 3: An illustration of the minimax optimal predictor g*, with the linear separators being low-error 
hypotheses in H, and the examples colored according to their true labels. The red and bine shaded 
areas indicate where the predictor is certain. 


Finally, define e(TO, q, qo,5) ■= y ^ yKL{c[ || qo) + log j j and the error-minimizing hypothesis 

h* = arg min^g.^ err (h). 

4.1 An Algorithm with a PAC-Bayes Analysis 

Algorithm: The learning algorithm we consider simply observes the labeled set S and chooses a distribution 
q over H that has low error on S. Based on this, it calculates a lower bound A (Eq. Q) on the correlation 
with T of the associated Gibbs classifier. Finally, it uses A in the average error constraint for the game of 
Section and predicts with the corresponding minimax optimal strategy g* as given in Theorem 

Analysis: We begin analyzing this algorithm by applying the PAC-Bayes theorem ([S]) to control 
[errg (h)], which immediately yields the following. 

Lemma 3. Choose any prior distribution qo over %. With probability > 1 — <5 over the choice of training 
set S, for all distributions q over % simultaneously, 

KL{Ehr^q [errs (h)] || Ehr^q [err{h)]) (5) 

<;t(/rL(q||q,) + log(n^)) 

This can be easily converted into a bound on erry (h), using a Hoeffding bound and the well-known 
inequality KL (p || g) > 2(p — g)^: 

Theorem 4. Choose any prior distribution qo over the hypotheses %. With probability > 1 — <5, for all 
distributions q over % simultaneously, 

E/g-q [^rr (h)] < [errs {h)] + e{m, q, qo, <5) 

Theorem [^w.h.p. controls the average classifier performance on the test set. Recall that this constrains 
nature in the minimax analysis, where it is expressed as a lower bound A on the average correlation 1 — 
2Eq [eirT (h)]. From Theorem with probability > 1 — for all q, 

A = 1 - 2Eq [errs (h)] - 2e(m,q,qo,5) (6) 

Inside this high-probability event, the scenario of Section(the prediction game) holds, with A given by 
([^. A higher correlation bound A leads to better performance. 

In the game of Section]^ values in [—1,1] can be thought of as parametrizing stochastic binary la¬ 
bels/predictions (see Footnote [^. By this token, if the value of the game is V, the prediction algorithm is 
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incorrect with probability at most ^ (1 — F) < 5 (^1 ~ ^ ~ ^ J2'i=i (1 ~ (from Q). Combining this 

with § and a union bound, the algorithm’s probability of error is at most 

1 

Eq [e?ts (h)] - ^ X! ^0’ 

Since holds uniformly over q, the training set S can be used to set q to minimize Eq [errg (/i)]. However, 
naively choosing a point distribution on argmin^errs (h) (performing empirical risk minimization) is inad¬ 
visable, because it does not hedge against the sample randomness. In technical terms, a point distribution 
leads to a high KL divergence term (using a uniform prior qo) in ([^, and of course eliminates any potential 
benefit from voting with respect to the Gibbs classifier. Instead, a higher-entropy distribution is apropos, 
like the exponential-weights distribution qexp{h) oc exp (—Tyerrg (/i)). 

Regardless, for sufficiently high m, e{m,6) is negligible and we can take Eq [errs (h)] « err(h*). So in 
this regime the classification error is < err(/i*); it can be much lower due to the final term of Q, again 
highlighting the benefit of voting behavior when there is ensemble disagreement. 

Many Good Classifiers and Infinite H. The guarantees of this section are particularly nontrivial if there 
is a significant (q-)fraction of hypotheses in "H with low error. The uniform distribution over these good 
hypotheses (call it Ug) has support of a significant size, and therefore KL{Ug || qg) is low, where qg is a 
uniform prior over "H. (The same approach extends to infinite hypothesis classes.) 

4.2 Discussion 

Our approach to generating confidence-rated predictions has two distinct parts: the PAC-Bayes analysis and 
then the minimax analysis. This explicitly decouples the selection of q from the aggregation of ensemble 
predictions, and makes clear the sources of robustness here: PAC-Bayes works with any data distribution 
and any posterior q, and the minimax scenario works with any q and yields worst-case guarantees. It also 
means that either of the parts is substitutable. 

The PAC-Bayes theorem itself admits improvements which would tighten the results achieved by our 
approach. Two notable ones are the use of generic chaining to incorporate finer-grained complexity of H 
[ 9 ], and an extension to the transductive online setting of sampling without replacement, in which no i.i.d. 
generative assumption is required [TO] . 

5 Extension to Abstention 

This section outlines a natural extension of the previous binary classification game, in which the predictor 
can choose to abstain from committing to a label and suffer a fixed loss instead. We model the impact of 
abstaining by treating it as a third classification outcome with a relative cost of a > 0, where a is a constant 
independent of the true labels. Concretely, consider the following general modification of our earlier game 
with parameter a > 0: 

1. Predictor: On an example i G [n], the predictor can either predict a value in [—1,1] or abstain 
(denoted by an output of T). When it does not abstain, it predicts gi G [—1)1] as in the previous 
game. But it does so only with probability 1 — for some pf G [0, 1]; the rest of the time it abstains, 
where Pr (output T on i) = 1 — Pr (predict gi on i) = pf. So the predictor’s strategy in the game is a 
choice of (g,p“), where g = (gi,...,g„)T e [-1,1]” and p“ = (p?,...,p“)T e [0,1]”. 

2. Nature: Nature chooses z as before to represent its randomized label choices for the data. 

3. Cost model: The predictor suffers cost (nature’s gain) of the form ^ E"=i h{zi, Zi), where zi G {gi, T} 
is the predictor’s output. The cost function li{-, •) incorporates abstention using a constant loss a > 0 
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if Zi =_L, regardless of Zii 




^{l-giZi) Zi=gi 


a 


Zi =_L 


( 8 ) 


In this game (the “abstain game”), the predictor wishes to minimize the expected loss w.r.t. the stochastic 
strategies of itself and nature, and nature plays to maximize this loss. So the game can be formulated as: 


1 

min max — y 
p“G[o,i]’', ze[-i,i]’*, 




11/ 

= —I- mm 

2 n\ p“G[o,i]" 

1 


i=l 


-max mm 

2 g6[-i,i]"ze[-i.i]". . 

iz^a>A 


n 


(9) 


5.1 Value of the Abstain Game 

Minimax duality does apply to (§, and the dual game is easier to work with. Calculating its value leads to 
the following result. 

Theorem 5. The value Vabst of the game 0 is as follows for a < ^. 

1. //a < 1 ^1 — I ^, then Vabst = ct (and the game is vacuous with the minimax optimal strategy 

pa* ^ 

1, then the value is nontrivial and can be bounded. Using Ordering\l\ of the 


2. If a > 


1 - 


z;?=ikd 




examples, let w = min z S [u] : — | \aj \ + (1 — 2a) |aj| | > A . Then a (l — ^) < Vabst < 

The first part of this result implies that if there is a low abstain cost a > 0, a low enough average 
correlation A, and not many disagreements among the ensemble, then it is best to simply abstain a.s. on all 
points. This is intuitively appealing, but it appears to be new to the literature. 

Though the value I4bst is obtainable as above, we find solving for the optimal strategies in closed form to 
be more challenging. As we are primarily interested in the learning problem and therefore tractable strategies 
for the abstain game, we abandon the minimax optimal approach and present a simple near-optimal strategy 
for the algorithm. This strategy has several favorable properties which facilitate comparison with the rest 
of the paper and with prior work. 




j=i+l 


5.2 Near-Optimal Strategy for the Abstain Game 

The following is our main result for the abstain game, derived in Section [5.4| 


Theorem 6. Using Ordering of the examples, define v as in Lemma [7| Let g* be the minimax optimal 
strategy for the predictor in the binary classification game without abstentions, as described in Theorem 
Suppose the predictor in the abstain game 0 plays (g*,p“’“*®) respectively, where 


a,alg 


i-|g* 

0 


a < 


a > 2 
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1. The worst-case loss incurred by the predictor is at most - (l -) when a > and a (l — -) H- 

2 V n/ ^ \ n/ 

(5-“) i. 

2. Nature can play z* to induce this worst-case loss. 


We remark that is nearly minimax optimal in certain situations. Specifically, its worst-case loss 

can be quite close to the ideal Vabst ~ a (l — ^), which is defined in Theorem In fact, v > w, so for 
pa,aig nearly minimax optimal, it suffices if the term (5 — a) ^ is low. Loosely speaking, 

this occurs when a « ^ or there is much disagreement among the ensemble, or when A (therefore v) is high. 

The latter is typical when err(h*) is low and the PAC-Bayes transduction algorithm is run; so in such 
cases, the simple strategy p°-Ng jg almost minimax optimal. Such a low-error case has been analyzed before 
in the abstaining setting, notably by m, who extrapolate from the much deeper theoretical understanding 
of the realizable case (when err {h*) = 0 ). 

We have argued that p°’Ng achieves a worst-case loss arbitrarily close to Vabst in the limit A —>■ 1, 
regardless of a. So we too are extrapolating somewhat from the realizable case with the approximately 
optimal though in a different way from prior work. 


Benefit of Abstention. Theoremj^clearly illustrates the benefit of abstention when p“’“^9 is played. To see 
this, define V as in @ to be the value of the binary classification game without abstention. Then the worst- 
case classification error of the minimax optimal rule without abstention is ^ (1 — P) < | (l — := 

An upper bound on the worst-case loss incurred by the abstaining predictor that plays (g*,p“’“^) is given 


1 

2 ’ I 


>) (1- 


kii 

|a„| 


Ln — La is positive 


and can be rewritten as := 5 (l - ^ J2i>v {h ~ ^) 

illustrating the benefit of abstention. There is no benefit if a > and increasing benefit the 
turttier below ^ it gets. The a > ^ result is to be expected - even the trivial strategy of predicting g = 0 is 
preferable to abstention if a > ^ - and echoes long-known results for this cost model in various settings m- 


Cost Model. The linear cost model we use, with one cost parameter a, is prevalent in the literature [Ilia, 
as its simplicity allows for tractable optimality analyses in various scenarios oiiaiTniiTi]. Many of these 
results, however, assume that the conditional label probabilities are known or satisfy low-noise conditions 
[l4] . Others explicitly use the value of a [H], which is an obstacle to practical use because a is often 
unknown or difficult to compute. Our near-optimal prediction rule sidesteps this problem, because the 
strategy (g*,p“’“*®) is independent of a in the nontrivial case a < ^. To our knowledge, this is unique in 
the literature, and is a major motivation for our choice of p“>“*9. 


5.3 Guarantees for a Learning Algorithm with Abstentions 

The near-optimal abstaining rule of Section can be bootstrapped into a PAC-Bayes prediction algorithm 
that can abstain, exactly analogous to Section for the non-abstaining case. Similarly to that analysis, 
PAC-style results can be stated for the abstaining algorithm for a < ^ (calculations in Appendix B), using 
q, qo, e(', ‘5 defined in Section]^ and any S € (0,1). The algorithm: 


abstains w.p. 

< 2 Eq [errs (h)] 2 e(m, q, qo, 5) -h 5 ^ 

and errs (f^T) w.p. 


2>U 


< Eq [e?is (h)] -f e(TO, q, qo, <5) -h <5 - — ^ (1 - |a,|) 


^As written here, Ln — La = ^ (§ ~ ct) ~ I?|) ~ ~'in dispensed with by lowering Ln 

using a slightly more careful analysis. 





Thus, by the same arguments as in the discussion of Section [Tl] the abstain and mistake probabilities are 
respectively < 2err {h*) and < err (h*) for sufficiently large m. Both are sharper than corresponding results 
of Freund et al. [4], whose work is in a similar spirit. 

Their setup is similar to our ensemble setting for abstention, and they choose q to be an exponential- 
weights distribution over H. In their work [4], the decision to abstain on the example is a deterministic 
function of \ai\ only (ours is a stochastic function of the full vector a), and any non-T predictions are made 
deterministically with the majority vote (ours can be stochastic). Predicting stochastically and exploiting 
transduction lead to our mistake probability being essentially optimal (Footnote as opposed to the « 
2err (h*) of Freund et al. [4] caused by averaging effects. Our abstain probability also compares favorably 
to the « 5err (h*) in [4]. 


5.4 Derivation of Theorem 


Define another ordering of the examples here: 


Ordering 


Order the examples so that 


|ai| 


> 


1^21 


1 - 1 - 


> 


> 


1-Pn 


□ 


We motivate and analyze the near-optimal strategy by considering the primal abstain game (|^. Note 
that the inner max-min of ([^ is very similar to the no-abstain case of Section]^ So we can solve it similarly 
by taking advantage of minimax duality, just as previously in Lemma 

Lemma 7. Using Ordering^of the examples w.r.t. any fixed p“, define V 2 = min G [n] : ^ l®i 

Then 



max min — 
ge[-i.i]" ze[-i,i]", n 
a>X 

1 

2=1 

Substituting Lemmaj^into (|^ still leaves a minimization over p“ in ([^. Solving this minimization would 
then lead to the minimax optimal strategy (g*, p“*) for the predictor. 

We are unable to solve this minimization in closed form, because the choice of p“ and Ordering depend 
on each other. However, we prove a useful property of the optimal solution here. 

Lemma 8. Suppose p“* is the minimax optimal predictor’s abstain strategy. Using Ordering of the 
examples w.r.t. p“*, define V 2 as in Lemma^^ If ct > then p“* =0. If a < ^, then for any i > V 2 , the 
minimax optimal pf* must be set so that 


^ Zi (1 - pf) gi 
2=1 



I I I ® j I 

1 - Pit 1 - pT 

The near-optimal abstain rule we choose has the properties outlined by Lemma but uses Ordering 
of the examples, as the results of Theorem use this ordering. 

A convenient consequence is that when p“"“*9 is played. Orderings and of the examples are effectively 
the same. 

® Hereafter we make two assumptions for simplicity. One is that there are no examples such that = 1 exactly. The other 
is to neglect the effect of ties and tiebreaking. These assumptions do not lose generality for our purposes, because coming 
arbitrarily close to breaking them is acceptable. 
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6 Conclusion 


We have presented an analysis of aggregating an ensemble for binary classification, using minimax worst- 
case techniques to formulate it as an game and suggest an optimal prediction strategy g*, and PAC-Bayes 
analysis to derive statistical learning guarantees. 

The transductive setting, in which we consider predicting on many test examples at once, is key to our 
analysis in this manuscript, as it enables us to formulate intuitively appealing and nontrivial strategies z*, g* 
for the game without further assumptions, by studying how the ensemble errors are allocated among test 
examples. We aim to explore such arguments further in future work. 
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A Proofs 


Proof of Lemma It suffices to find the value of the dual game ([^ . Consider the inner optimization problem 
faced by the predictor in this game (where z is fixed and known to it): 


Find: max —z^g Such that: — 1” < g < 1" 


( 10 ) 


The contribution of the example to the payoff of (10) is ^giZi; to maximize this within the constraint 
— 1 < 5i < 1, it is clear that the predictor will set gi = sgn)^^). The predictor therefore plays g = sgn(z), 
where sgn(-) is taken componentwise. 

With this g, the game (for the average performance constraint) reduces to 


Find: min —z^g = min — 
z n z n 


El 

2=1 


Such that: 


1 r^T 


a > A and — < z < 


(11) 


For any i G [n], changing the value of Zi from 0 to e raises the payoff by ^ |e|, and can raise ^z^a by at most 
i |aie|. Thus, the data examples which allow nature to progress most towards satisfying the performance 
constraint are those with the highest |ai|, for which nature should set Zi = ±1 to extract every advantage 
(avoid leaving slack in the hypercube constraint). This argument holds inductively for the first u — 1 examples 
before the constraint > A is satisfied; the example can have \zi\ < 1 from boundary effects, and 

for i > V, nature would set = 0 to minimize the payoff. (All this can also be shown by checking the KKT 
conditions.) 

Consequently, the z that solves (11) can be defined by a sequential greedy procedure: 

1 . Initialize z = 0, “working set” of examples S' = 0. 

2. Select an i G argmax|aj| and set S = S U {i}. 

jG[n]\S 

3. If n l®il ^ ~ sgn(ai) and go back to step 2. 

4. Else: set Zi = sgn(ai) — A \aj \ — nA^ and terminate, returning z. 


Call the vector set by this procedure z. Then under the constraints of (Isl), min max — z^g = max — z^g = 

^ z g n s n 

V -I 1 ( ’'“^1 \ 

-h ■;—r A — > — laJ = E , as desired. ■ 

n a„ \ « ' 


n 


V — I 


2=1 


sgn(a.„) - — ( loil - nA 


^2 = 1 


2 = 1 


Proof of Theorem^ We have already considered the dual game in Lemmafrom which it is clear that if 
z* is played, then regardless of g, ^z*^g < ~ l®»l) ~ therefore z* is minimax 

optimal. 

Now it suffices to prove that the predictor can force a correlation of > E by playing g* in the primal 
game, where it plays first. After g* is played, nature is faced with the following problem: 


Find: 
Such that: 


. 1 T * 

mm —z g 
z n 


( 12 ) 


1 r,T 


z a > A and — 1" < z < 1’' 
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Now since -z a > A, we have the inequality 


V 1 " 

-z^g* = - ^sgn(a,)zi + —— I 




1 ^ , , 1 r 1 ^ 

> - > sgn(a* Zi + ^— r A - - > 
n I Oi I ' 

1 / 


\ai 

Wv 


i=l 

A 

aA 


( 13 ) 


For i G [l,t> — 1], 1 — < 0. So to minimize ( |13| ) (and solve (dH)), nature sets Zi so that sgn(ai)zi 

is maximized. For each * < u — 1, nature can force sgn(ai)zi = 1 by setting Zi = sgn(ai), and this is the 
maximum possible: sgn(ai)zi < \zi\ < 1. 

From (HI, we see that the values {zi}i>v are irrelevant to the payoff, so any setting of z that sets 
Zi = sgn(ai) for z < r? — 1 (call such a setting z) will solve (12). 


• f ~r * ^ ^ ~r * 

min —z g = —z g 
z n n 


2 = 1 


=-E(i-n)+A=^ 

n \ \n \ J h? 


so we have argued that g* forces a correlation of > V, and therefore it is minimax optimal. 


Proof of Theorem^ Note that the abstain game ([^ is linear in all three vectors g, p“, z. All three constraint 
sets are convex and compact, as well, so the minimax theorem can be invoked (see Prop. [^, yielding 


Vabst = min min max 

p“G[ 0 ,l]-gG[-l.l]"zG[-l,l]", 

fz~^a>X 


Pia+^{l-pt)A-9iZi) 
= max min min 

zC[-l,l]-, paG[ 0 ,l]-gG[-l,l]- 
:^z'^a>A 



1 

n 


E 


p> + 2(1 


p“) (1 - giZi) 


= max min 

zc[-l,l]'', p“C[0,l]"^ 

:|^z'^a>A 

p>+\{l-pt){l-\z^\) 



= max 

:^z'^a>A 


^Emin (^a,^(l- |z,|)^ 


(14) 


From this point, the analysis is a variation on the proof of Lemmafrom © onwards. 

Consider nature’s strategy when faced with (14). A trivial upper bound on ( |l4| ) is a regardless of z, and 
for any i it is possible to set Zi such that \zi\ < 1 — 2a without lowering ( |14[ ) from a. So nature can first 
set z = (1 — 2a) sgn(a), to progress most towards satisfying the -z^a > A constraint while maintaining the 
value at a. 


If z = (1 —2a) sgn(a) meets the constraint ^z'a > A, i.e. 
then the value is clearly a. 


l-2a 


E n 
2=1 


> A 


a < 5(1 — 
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Otherwise, i.e. if c := A— I®*! > Oj then nature must start with the setting z' := (1 —2a) sgn(a) 

and continue raising \zi\ for some indices i until the constraint > A is met. c can be thought of as a 

budget that nature must satisfy by starting from z' and adjusting {z'jigs away from 0 for some subset S of 
the indices [n]. 

For any i, raising \z[\ in this way by some small e raises nature’s payoff by and lowers the remaining 
budget by e|ai|. Therefore, to satisfy the budget with maximum payoff, the examples get \zi\ set to 1 in 
descending order of |ai| (Ordering which we therefore use for the rest of this proof) until the remaining 
budget to satisfy runs out. 

This occurs on the example (using Ordering 1), where w = min S [n] : ^ 
after a little algebra is equivalent to the statement ofthe theorem. 

Substituting this into ([T4|), we get 


1 iflil > c 


}. 


which 



min(a,0) + - (1 - \z, 




2n a„ 



(15) 


Now by definition of w, ^ X]7=i hil < c < ^ l®il- Using this to bound (15) gives the result. ■ 

Proof of Lemma^ Just as with the non-abstaining game in Lemma [l] it is clear that minimax duality 
applies to this game. Therefore, we can find the value of the dual game instead, which is done directly using 
the same reasoning as in the proof of Lemma 


max mm 
— z'^a>A 




mm 
— z'''a> A 


1 

max — y 
ge[-i,i]- n ^ 


(1 


mm 

— z'^a> A 


^ V2-1 


1 

2=1 




1-K 


V2 — 1 


' -I ^ 


Proof of Lemma Define V 2 as in Lemma throughout this proof, we use Ordering of the examples. 
Substituting Lemma into the value of the game is 


Vabst = X + - min 
2 n p“G[o,i]^ 


E 


aPi 


n 


E 

2=112 + 1 


a-- 


(16) 


It only remains to (approximately) solve the minimization in (16), keeping in mind that V 2 depends on 
pf because of the ordering of the coordinates (Ordering]^. 

If a > then regardless of i, neither sum of can increase with increasing pf. In this case, the 
minimizer p“* = 0, identically zero for all examples. 
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If a < ij, consider an example z := Xi for some i > V 2 - We prove that . If this is not true, 

^ Pv2 ^ Pi 

i.e. , 1^*1. , then p“* can be raised while keeping z out of the top V 2 examples. This would decrease 

the second sum of ([T6| because a — ^ < 0, which contradicts the assumption that pf* is optimal. ■ 


Proof of Theorem^ Under the optimal strategy p“*, Orderings and are identical for our purposes, so 
V2 = V. Revisiting the argument of Lemma with this information, the predictor’s and Nature’s strategies 
g* and z* are identical to their minimax optimal strategies in the non-abstaining case. Substituting 
into (16) gives the worst-case loss after simplification. ■ 


B PAC-Style Results for the Abstain Game 


This appendix contains the calculations used to prove the results of Section 5.3 
If the algorithm abstains at all (a < |), the overall probability that it does so is 




2=1 


i^v 


|ad 

\a,,\ 


<i_ly M 


i<.v 


71 


n ^ \av 
2 > 1 ? 


Using a union bound with ([^ gives the result on the abstain probability. 

When a < i, the probability that the predictor predicts an incorrect label (t^T) under minimax optimal 
play depends on nature’s play z. It can be calculated as (w.p. > 1 — 5): 


1 ” 

2=1 


(17) 


1 

2n 


2n ^ lot, 
2>2; 


. i=l 


y min ( 1,) - y ZiSgn(ai) - y ZiSgn(ai)^ 


2<2; 


2>U 


(18) 


Under the constraints — 1" < z < 1” and > A, the maximizer of 0 w.r.t. z is indeed z*, so the 

chance of predicting incorrectly is 


Pmc(z) ^ Pinci^ ) — 


i=l 

n — V 1 
2n 2 


(’-p 


2=1 


Using a union bound with (§ gives the result on the probability of an incorrect non-abstention. 
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C Application of the Minimax Theorem 

Proposition 9. Let R = {v G K" : —1" < v < 1”} and A = {z S K" : > A}. Then 

1 T ■ 1 T 

max min —z g = mm max —z g 
gG-R zG-RnA n z^RnA gGR n 

Proof. Both R and A are convex and compact, as is i? n A. The payoff function is linear in z and g. 

Therefore the minimax theorem (e.g. m Theorem 7.1) applies, giving the result. ■ 
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